CALLFORIT

Author

Joseph H

1 Intoduction

This report is one of 493,845 that I will make, and one of 104,070,413 that could be made.
I “toke” the 1.4 TB Linked-In data that was breached in 2020, and turned it into some insights to power my job HUNT.
The insights I could share in this report, that are also related to my goals, are:
- Industry base recruitment trend.
- Company base workforce timeline.
- Current/part workforce info:
- Basic info: Name, job title, status, social link. I could add geo-location for some that have the data, but it would look creepy.
- Their work period.
- Their experiences.

2 About me

Salutations; I’m Joseph, a self-taught data analyst, engineer, and scraper.
Despite life’s challenges, my goal remains a remote job, either full or part-time, and having friends to tackle the challenges of this changing world with.
To show my skills and dedication, I made this project that yielded this tailored report.

3 About the project

3.1 How this project comes to life?

You would know by now, from my email, that I am hunting for a job.
About a year ago, I scraped contact info from Google Map to get my first job. Later I scraped contact from Linked-In website… you can check how that went in here.

Recently, I finally got to learning SQL because of DuckDB, it is a software that allows you to process big data in your local machine by using storage space as RAM; Then I remembered about a leaked Linked-In data that I couldn’t process.
Thus my journey started to learn SQL, process the data, and make something out of it.

3.2 The process

The process was done in my local machine, and it was as followed.

3.2.1 Downloaded the leaked data

I downloaded the data from a torrent.
There was around 700 .gz file, each is around 280 Mb; 196 GB in total.
Each .gz file contain a 2 GB file; 1.4 TB in total.
Each file have multiple lines, and each one of them is a JSON; Not the file is a JSON, it just have multiple JSONs, one in each line.

3.2.2 Processing the weird data

I this phase I created a script that automatically open an archive, process the file, and save it as a Parquet file with compression level of 22.
I used Python, Pathlib, Polars, and a lot of patience.
The process toke around 20 minutes per file, in total it toke around three weeks (I had to shutdown my PC at night) The result was 700 parquet files, each is around 190 Mb; 133 GB in total.

3.2.3 Making relational database

The data in the datasets were nested, especially the “experience” field, it had the experience of a person and the company info; The problem is that the company info get repeated multiple tiles, across all datasets.
Making a relational database will solve this, and make the exploratory data analysis easier.
The code was split in two:
1. I used Polars to split each of the 700 datasets into mini relational databases.
2. I used DuckDB to merge all the mini relational databases and remove duplicates in some, mainly company and university information’s.

The result was a relational database that is 73 GB in size; From 1.4 TB to 73 GB.

All of this is using my PC, so servers were harmed, only my CPU fan and my ear.

3.2.4 Filter

I filtered out companies base on their industry, country, and whether I have the email of one of the higher ups.

4 About the project

4.1 How this project comes to life?

First time I searched for a remote job I scraped contact info from Google map, then switch to LinkedIn… But I could never contact anyone on a “personal level”; So, I remembered the existance of a 2020 LinkedIn data leak and decided to give it a try.

The tools I used are: Python, Polars, and DuckDB

4.2 The process

The process was done in my local machine, and it was as followed.

Alt text Alt text Alt text Alt text

5 General graphs

5.1 information technology and services indestry’s yearly new recruit count

5.2 callforit’s workforce status over the years

6 Workforce sample

6.1 Hadi Mohammed

Job title: Business owner
Associated: True
Socials: https://linkedin.com/in/hadi-mohammed | https://linkedin.com/in/hadi-mohammed-52808b43

6.1.1 Hadi Mohammed’s working period at callforit

6.1.2 Gantt plot of Hadi Mohammed’s experience